Kanji-to-Hiragana conversion based on a length-constrained n-gram analysis

نویسندگان

  • Joseph Picone
  • Tom Staples
  • Kazuhiro Kondo
  • Nozomi Arai
چکیده

Length-Constrained N-Gram Analysis by, Joe Picone, Tom Staples, Kazz Kondo, and Nozomi Arai Tsukuba Research and Development Center Texas Instruments Tsukuba, Japan Abstract A common problem in speech processing is the conversion of the written form of a language to a set of phonetic symbols representing the pronunciation. In this paper we focus on an aspect of this problem specific to the Japanese language. Written Japanese consists of a mixture of three types of symbols: kanji, hiragana and katakana. We describe an algorithm for converting conventional Japanese orthography to a hiragana-like symbol set that closely approximates the most common pronunciation of the text. The algorithm is based on two hypotheses: (1) the correct reading of a kanji character can be determined by examining a small number of adjacent characters; (2) the number of such combinations required in a dictionary is manageable. The algorithm described here converts the input text by selecting the most probable sequence of orthographic units (n-grams) that can be concatenated to form the input text. In closed-set testing, the n-gram algorithm was shown to provide better performance than several public domain algorithms, achieving a sentence error rate of 3% on a wide range of text material. Though the focus of this paper is written Japanese, the pattern matching algorithm described here has applications to similar problems in other languages. EDICS: (1) SA 1.3.1; (2) SA 1.9 Please direct all correspondence to: Dr. Joseph Picone, Institute for Signal and Information Processing, Mississippi State University, Box 9571, Mississippi State, MS 39762, Tel: (601) 325-3149, Fax: (601) 325-2298, email: [email protected].

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Kanji-to-Hiragana Conversion Based on a Length-Constrained -Gram Analysis

A common problem in speech processing is the conversion of the written form of a language to a set of phonetic symbols representing the pronunciation. In this paper, we focus on an aspect of this problem specific to the Japanese language. Written Japanese consists of a mixture of three types of symbols: kanji, hiragana, and katakana. We describe an algorithm for converting conventional Japanese...

متن کامل

Kanji-to-hiragana conversion based on a language model

In speech recognition systems, a common problem is transcription of new additions to the recognition lexicon into their phonetic symbols. Specific to the Japanese language, such a problem can be dealt with in two steps. In this paper, we focus on the first step, in which the new lexical entry is converted into a set of hiragana syllabaries, which is almost a phonetic transcription. We propose a...

متن کامل

Inter- and Intrahemispheric Connectivity Differences When Reading Japanese Kanji and Hiragana

Unlike most languages that are written using a single script, Japanese uses multiple scripts including morphographic Kanji and syllabographic Hiragana and Katakana. Here, we used functional magnetic resonance imaging with dynamic causal modeling to investigate competing theories regarding the neural processing of Kanji and Hiragana during a visual lexical decision task. First, a bilateral model...

متن کامل

Investigating the mixture and subdivision of perceptual and conceptual processing in Japanese memory tests.

The dual nature of the Japanese writing system was used to investigate two assumptions of the processing view of memory transfer: (1) that both perceptual and conceptual processing can contribute to the same memory test (mixture assumption) and (2) that both can be broken into more specific processes (subdivision assumption). Supporting the mixture assumption, a word fragment completion test ba...

متن کامل

The role of interword spacing in reading Japanese: An eye movement study

The present study investigated the role of interword spacing in a naturally unspaced language, Japanese. Eye movements were registered of native Japanese readers reading pure Hiragana (syllabic) and mixed Kanji-Hiragana (ideographic and syllabic) text in spaced and unspaced conditions. Interword spacing facilitated both word identification and eye guidance when reading syllabic script, but not ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • IEEE Trans. Speech and Audio Processing

دوره 7  شماره 

صفحات  -

تاریخ انتشار 1999